In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:
mice (version: 3.8.0)visdat (version: 0.5.3)JointAI (version: 0.6.1)VIM (version: 5.1.1)plyr (version: 1.8.6)corrplot (version: 0.84)ggplot2 (version: 3.3.0)You can find help files for any function by adding a ? before the name of the function.
Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html
For this practical, we will use the EP16dat1 dataset, which is a subset of the NHANES (National Health and Nutrition Examination Survey) data.
To get the EP16dat1 dataset, load the file EP16dat1.RData. You can download it here.
To load this dataset into R, you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file on your computer.
If you know the path to the file, you can also use load("<path>/EP16dat1.RData").
Take a first look at the data. Useful functions are dim(), head(), str() and summary().
## [1] 1000 24
## HDL race DBP bili smoke DM gender WC chol HyperMed alc SBP wgt
## 858 1.58 Other Hispanic 56.667 0.5 never no female 64.5 4.29 <NA> <=1 104.67 46.266
## 103 1.24 other 81.333 0.9 never no male 81.4 4.27 <NA> <NA> 124.67 63.503
## 338 1.71 Non-Hispanic White 70.000 0.7 current no male 76.0 4.22 <NA> <NA> 133.33 62.142
## 344 1.01 Non-Hispanic White 66.000 0.4 never no female 139.5 3.96 <NA> <=1 141.33 113.852
## 2382 1.09 Non-Hispanic White 69.333 0.9 current no male 94.6 4.97 <NA> 3-7 117.33 102.058
## 2479 NA Non-Hispanic White 71.333 NA former no female 101.8 NA yes 0 148.00 70.307
## hypten cohort occup age educ albu creat uricacid BMI hypchol hgt
## 858 no 2011 working 22 some college 3.8 0.61 3.6 19.273 no 1.5494
## 103 no 2011 working 39 College or above 4.3 0.87 5.4 19.526 no 1.8034
## 338 no 2011 working 51 College or above 4.3 0.89 6.2 22.112 no 1.6764
## 344 yes 2011 working 45 College or above 3.6 0.61 4.3 41.768 no 1.6510
## 2382 no 2011 working 31 High school graduate 4.9 0.83 6.1 32.284 no 1.7780
## 2479 yes 2011 not working 75 Less than 9th grade NA NA NA 24.276 <NA> 1.7018
## 'data.frame': 1000 obs. of 24 variables:
## $ HDL : num 1.58 1.24 1.71 1.01 1.09 NA NA 1.16 1.16 1.5 ...
## $ race : Factor w/ 5 levels "Mexican American",..: 2 5 3 3 3 3 2 3 1 3 ...
## $ DBP : num 56.7 81.3 70 66 69.3 ...
## $ bili : num 0.5 0.9 0.7 0.4 0.9 NA NA 0.9 0.6 0.9 ...
## $ smoke : Ord.factor w/ 3 levels "never"<"former"<..: 1 1 3 1 3 2 1 2 2 3 ...
## $ DM : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 2 1 1 1 ...
## $ gender : Factor w/ 2 levels "male","female": 2 1 1 2 1 2 1 1 1 1 ...
## $ WC : num 64.5 81.4 76 139.5 94.6 ...
## $ chol : num 4.29 4.27 4.22 3.96 4.97 NA NA 5.2 5.56 4.86 ...
## $ HyperMed: Ord.factor w/ 3 levels "no"<"previous"<..: NA NA NA NA NA 3 3 NA NA NA ...
## $ alc : Ord.factor w/ 5 levels "0"<"<=1"<"1-3"<..: 2 NA NA 2 4 1 1 5 3 4 ...
## $ SBP : num 105 125 133 141 117 ...
## $ wgt : num 46.3 63.5 62.1 113.9 102.1 ...
## $ hypten : Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 2 1 1 1 ...
## $ cohort : chr "2011" "2011" "2011" "2011" ...
## $ occup : Factor w/ 3 levels "working","looking for work",..: 1 1 1 1 1 3 3 1 1 1 ...
## $ age : num 22 39 51 45 31 75 73 52 29 40 ...
## $ educ : Factor w/ 5 levels "Less than 9th grade",..: 4 5 5 5 3 1 2 4 3 1 ...
## $ albu : num 3.8 4.3 4.3 3.6 4.9 NA NA 3.9 4.9 4.3 ...
## $ creat : num 0.61 0.87 0.89 0.61 0.83 NA NA 0.91 0.93 0.94 ...
## $ uricacid: num 3.6 5.4 6.2 4.3 6.1 NA NA 7.8 3.9 4.9 ...
## $ BMI : num 19.3 19.5 22.1 41.8 32.3 ...
## $ hypchol : Factor w/ 2 levels "no","yes": 1 1 1 1 1 NA NA 1 1 1 ...
## $ hgt : num 1.55 1.8 1.68 1.65 1.78 ...
## HDL race DBP bili smoke DM
## Min. :0.36 Mexican American :112 Min. : 12.7 Min. :0.200 never :608 no :853
## 1st Qu.:1.11 Other Hispanic :110 1st Qu.: 64.0 1st Qu.:0.600 former :191 yes:147
## Median :1.32 Non-Hispanic White:350 Median : 70.7 Median :0.700 current:199
## Mean :1.39 Non-Hispanic Black:229 Mean : 70.8 Mean :0.753 NA's : 2
## 3rd Qu.:1.58 other :199 3rd Qu.: 77.3 3rd Qu.:0.900
## Max. :4.03 Max. :130.0 Max. :2.900
## NA's :86 NA's :59 NA's :95
## gender WC chol HyperMed alc SBP
## male :493 Min. : 61.9 Min. : 2.07 no : 25 0 :113 Min. : 81.3
## female:507 1st Qu.: 85.0 1st Qu.: 4.27 previous: 20 <=1 :281 1st Qu.:109.3
## Median : 95.1 Median : 4.91 yes :142 1-3 :105 Median :118.0
## Mean : 96.4 Mean : 5.00 NA's :813 3-7 : 81 Mean :120.2
## 3rd Qu.:105.5 3rd Qu.: 5.61 >7 :101 3rd Qu.:128.7
## Max. :154.7 Max. :10.68 NA's:319 Max. :202.0
## NA's :49 NA's :86 NA's :59
## wgt hypten cohort occup age
## Min. : 39.0 no :693 Length:1000 working :544 Min. :20.0
## 1st Qu.: 63.5 yes :265 Class :character looking for work: 46 1st Qu.:31.0
## Median : 76.9 NA's: 42 Mode :character not working :393 Median :43.0
## Mean : 78.3 NA's : 17 Mean :45.2
## 3rd Qu.: 89.1 3rd Qu.:59.0
## Max. :167.8 Max. :79.0
## NA's :22
## educ albu creat uricacid BMI hypchol
## Less than 9th grade : 83 Min. :3.00 Min. :0.440 Min. :2.30 Min. :15.3 no :813
## 9-11th grade :133 1st Qu.:4.10 1st Qu.:0.690 1st Qu.:4.40 1st Qu.:23.2 yes :101
## High school graduate:228 Median :4.30 Median :0.820 Median :5.30 Median :26.6 NA's: 86
## some college :283 Mean :4.29 Mean :0.852 Mean :5.36 Mean :27.5
## College or above :272 3rd Qu.:4.50 3rd Qu.:0.950 3rd Qu.:6.20 3rd Qu.:30.7
## NA's : 1 Max. :5.40 Max. :7.460 Max. :9.90 Max. :60.5
## NA's :91 NA's :91 NA's :92 NA's :37
## hgt
## Min. :1.40
## 1st Qu.:1.63
## Median :1.68
## Mean :1.68
## 3rd Qu.:1.75
## Max. :1.98
## NA's :22
It is important to check that all variables are coded correctly, i.e., have the correct class. Imputation software (e.g., the mice package) uses the class to automatically select imputation methods. When importing data from other software, it can happen that factors become continuous variables or that ordered factors lose their ordering.
str() (in the solution above) showed that in the EP16dat1 data the variables smoke and alc are correctly specified as ordinal variables, but educ is an unordered factor.
Using levels(EP16dat1$educ) we can print the names of the categories of educ. Convert the unordered factor to an ordered factor, for example using as.ordered(). Afterwards, check if the conversion was successful.
In the summary() we could already see that there are missing values in several variables. The missing data pattern can be obtained and visualized by several functions from different packages. Examples are
md.pattern() from package micemd_pattern() from package JointAI (with argument pattern = TRUE)aggr() from package VIMvis_dat() and vis_miss() from package visdatExplore the missing data pattern of the EP16dat1 data.
## race DM gender cohort age educ smoke occup wgt hgt BMI hypten WC DBP SBP HDL chol hypchol albu
## 33 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 527 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 75 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 159 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## 15 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 21 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 12 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 10 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1
## 10 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 6 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0
## 3 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0
## 3 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0
## 3 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
## 2 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
## 2 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0 0
## 11 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1
## 12 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0
## 3 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0
## 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
## 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
## 2 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 0 0 0
## 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1 1 1 1
## creat uricacid bili alc HyperMed
## 33 1 1 1 1 1 0
## 527 1 1 1 1 0 1
## 75 1 1 1 0 1 1
## 159 1 1 1 0 0 2
## 2 1 1 0 1 0 2
## 1 1 1 0 0 0 3
## 1 1 0 0 1 0 3
## 3 0 0 0 1 0 5
## 2 0 0 0 0 1 5
## 15 0 0 0 1 1 7
## 21 0 0 0 1 0 8
## 8 0 0 0 0 1 8
## 12 0 0 0 0 0 9
## 10 1 1 1 1 1 2
## 3 1 1 1 0 1 3
## 10 1 1 1 1 1 1
## 6 1 1 1 1 0 2
## 4 1 1 1 0 1 2
## 3 1 1 1 0 0 3
## 3 0 0 0 1 1 8
## 3 0 0 0 0 1 9
## 3 0 0 0 0 0 10
## 3 1 1 1 0 1 4
## 1 0 0 0 1 1 10
## 2 0 0 0 0 1 11
## 2 1 1 1 1 0 2
## 1 0 0 0 1 0 9
## 11 1 1 1 1 0 4
## 12 1 1 1 0 0 5
## 2 0 0 0 1 0 11
## 3 0 0 0 0 0 12
## 1 1 1 1 0 0 6
## 1 0 0 0 1 0 12
## 1 0 0 0 0 0 13
## 2 1 1 1 1 1 2
## 4 1 1 1 1 0 3
## 1 1 1 1 0 1 3
## 2 1 1 1 0 0 4
## 1 0 0 0 1 0 10
## 1 1 1 1 0 1 6
## [ reached getOption("max.print") -- omitted 24 rows ]
The function md.pattern() from the package mice gives us a matrix where each row represents one missing data pattern (1 = observed, 0 = missing). The row names show how many rows of the dataset have the given pattern. The last column shows the number of missing values in each pattern, the last row the number of missing values per variable.
md.pattern() also plots the missing data pattern automatically.
## race DM gender cohort age educ smoke occup wgt hgt BMI hypten WC DBP SBP HDL chol hypchol albu
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 8 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1
## 9 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1
## 10 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1
## 11 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 12 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1
## 13 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 14 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 15 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 16 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1
## 17 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1
## 18 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 19 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0
## 20 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1
## 21 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1
## 22 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0
## 23 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1
## 24 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1
## 25 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## 26 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0
## 27 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0
## 28 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 29 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 0 0 0 0
## 30 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
## 31 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0
## 32 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 33 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1
## 34 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0
## 35 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1
## 36 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
## 37 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## 38 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 39 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 40 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0
## creat uricacid bili alc HyperMed Npat
## 1 1 1 1 1 0 527
## 2 1 1 1 0 0 159
## 3 1 1 1 0 1 75
## 4 1 1 1 1 1 33
## 5 0 0 0 1 0 21
## 6 0 0 0 1 1 15
## 7 0 0 0 0 0 12
## 8 1 1 1 0 0 12
## 9 1 1 1 1 0 11
## 10 1 1 1 1 1 10
## 11 1 1 1 1 1 10
## 12 1 1 1 1 0 9
## 13 0 0 0 0 1 8
## 14 1 1 1 1 0 6
## 15 1 1 1 0 1 4
## 16 1 1 1 1 1 4
## 17 1 1 1 1 0 4
## 18 1 1 1 1 0 4
## 19 0 0 0 1 1 3
## 20 1 1 1 0 0 3
## 21 1 1 1 0 1 3
## 22 0 0 0 0 0 3
## 23 1 1 1 0 1 3
## 24 1 1 1 1 1 3
## 25 0 0 0 1 0 3
## 26 0 0 0 0 1 3
## 27 0 0 0 0 0 3
## 28 1 1 1 0 0 3
## 29 0 0 0 0 1 2
## 30 0 0 0 0 1 2
## 31 0 0 0 1 0 2
## 32 1 1 0 1 0 2
## 33 1 1 1 0 1 2
## 34 0 0 0 0 0 2
## 35 1 1 1 0 0 2
## 36 1 1 1 1 0 2
## 37 0 0 0 0 1 2
## 38 1 1 1 1 1 2
## 39 1 1 1 0 0 2
## 40 0 0 0 0 0 1
## [ reached getOption("max.print") -- omitted 24 rows ]
The function md_pattern() from package JointAI gives a matrix very similar to the one obtained from mice::md.pattern(). However, here, the number of rows in the data that have a particular missing data pattern are given in the last column.
For more information on how to customize the visualization by md_pattern() see the vignette Visualizing Incomplete Data on the webpage of JointAI.
aggr() from VIM plots a histogram of the proportion of missing values per column as well as a visualization of the missing data pattern. Here, a small histogram on the right edge of the plot shows the proportion of cases in each pattern.
For more options how to customize the plot, see the VIM documentation.
Now, get an overview of
The function complete.cases() can be applied to a data.frame and will give you a vector of values TRUE (if the the row does not have missing values) and FALSE if there are missing values.
is.na() returns TRUE if the supplied object is missing, FALSE if it is observed.
colSums() calculates the sum of values in each column of a data.frame or matrix
colMeans() calculates the mean of values in each column of a data.frame or matrix
# number and proportion of complete cases
cbind(
"#" = table(ifelse(complete.cases(EP16dat1), 'complete', 'incomplete')),
"%" = round(100 * table(!complete.cases(EP16dat1))/nrow(EP16dat1), 2)
)## # %
## complete 33 3.3
## incomplete 967 96.7
# number and proportion of missing values per variable
cbind("# NA" = sort(colSums(is.na(EP16dat1))),
"% NA" = round(sort(colMeans(is.na(EP16dat1))) * 100, 2))## # NA % NA
## race 0 0.0
## DM 0 0.0
## gender 0 0.0
## cohort 0 0.0
## age 0 0.0
## educ 1 0.1
## smoke 2 0.2
## occup 17 1.7
## wgt 22 2.2
## hgt 22 2.2
## BMI 37 3.7
## hypten 42 4.2
## WC 49 4.9
## DBP 59 5.9
## SBP 59 5.9
## HDL 86 8.6
## chol 86 8.6
## hypchol 86 8.6
## albu 91 9.1
## creat 91 9.1
## uricacid 92 9.2
## bili 95 9.5
## alc 319 31.9
## HyperMed 813 81.3
In the missing data pattern we could already see that some variables tend to be missing together. But there may be different types of relationships between variables that are of interest.
Our data contains hgt (height), wgt (weight) and BMI. Check the missing data pattern specifically for those three variables.
There are multiple solutions how you could approach this.
We could obtain the missing data pattern for only the three variables of interest:
## hgt wgt BMI Npat
## 1 1 1 1 963
## 2 1 0 0 15
## 3 0 1 0 15
## 4 0 0 0 7
## Nmis 22 22 37 81
Alternatively, we could look at a 3-dimensional table of the missing data indicator for the three variables:
with(EP16dat1,
table(ifelse(is.na(hgt), 'height mis.', 'height obs.'),
ifelse(is.na(wgt), 'weight mis.', 'weight obs.'),
ifelse(is.na(BMI), 'BMI mis.', 'BMI obs.'))
)## , , = BMI mis.
##
##
## weight mis. weight obs.
## height mis. 7 15
## height obs. 15 0
##
## , , = BMI obs.
##
##
## weight mis. weight obs.
## height mis. 0 0
## height obs. 0 963
We could also use a table in long format to describe the missing data pattern:
plyr::ddply(EP16dat1, c(height = "ifelse(is.na(hgt), 'missing', 'observed')",
weight = "ifelse(is.na(wgt), 'missing', 'observed')",
BMI = "ifelse(is.na(BMI), 'missing', 'observed')"),
plyr::summarize,
N = length(hgt)
)## height weight BMI N
## 1 missing missing missing 7
## 2 missing observed missing 15
## 3 observed missing missing 15
## 4 observed observed observed 963
As we have also seen in the lecture, there are some cases where only either hgt or wgt is missing. BMI is only observed when both components are observed. To use all available information, we want to impute hgt and wgt separately and calculate BMI from the imputed values.
The data contains variables on hypertension (hypten) and the use of medication to treat hypertension (HyperMed). We might expect that medication is only prescribed for patients with hypertension, hence, we need to investigate the relationship between HyperMed and hypten.
table(), you need to specify the argument ‘exclude = NULL’.
## hypten
## HyperMed no yes <NA>
## no 0 25 0
## previous 0 20 0
## yes 0 142 0
## <NA> 693 78 42
Furthermore, we can expect a systematic relationship between the variables hypchol (hypercholesterolemia) and chol (serum cholesterol).
Based on this plot it seems that hypchol is yes if chol is above some threshold. To be sure, and to find out what this threshold may be, we need to look at the numbers:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.07 4.22 4.78 4.75 5.38 6.18 86
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6.21 6.39 6.70 6.95 7.16 10.68 86
Indeed, there is no overlap between values of chol for participants in the two groups defined by hypchol. The cut-off for hypchol is probably 6.2.
Visualize the correlations between variables to detect associations that you may have missed.
# convert all variables to numeric
dat_num <- sapply(EP16dat1, as.numeric)
# calculate the Spearman correlation matrix
cormat <- cor(dat_num, use = 'pair', method = 'spearman')
# plot the correlation matrix
corrplot::corrplot(cormat, method = 'square', type = 'lower')Note: Correlations involving categorical variables are not usually done! We only use this as a quick-and-dirty method for visualization of relationships.
The question marks in the plot indicate that no correlation could be calculated for cohort (because cohort is the same for all subjects) and neither between HyperMed and hypten (but we already knew that from the lecture).
Besides the relations we were already aware of, the plot shows us some additional potentially relevant correlations, for example, that wgt is strongly correlated with WC.
Comparing the number of cases where only either wgt or WC is missing shows that there are 14 cases where wgt is missing but WC is observed and 58 cases where WC is missing and wgt is observed.
with(EP16dat1, table(ifelse(is.na(WC), 'WC mis.', 'WC obs.'),
ifelse(is.na(wgt), 'wgt mis.', 'wgt obs.')
))##
## wgt mis. wgt obs.
## WC mis. 4 45
## WC obs. 18 933
Before imputing missing values it is important to take a look at how the observed parts of incomplete variables are distributed, so that we can choose appropriate imputation models.
Visualize the distributions of the incomplete continuous and categorical variables. The package JointAI provides the convenient function plot_all(), but you can of course use functions from other packages as well.
Pay attention to
Note: When you use RStudio and the “Plots” panel is too small, you may not be able to show the distributions of all variables. You can try to extend the panel, or tell R to plot the graph in a new window, by first calling windows().
It can also help to reduce the white space around the separate plots by changing the margins in the graphical parameters: par(mar = c(3, 3, 2, 1))
To learn more about additional options of plot_all() check out the vignette Visualizing Incomplete Data on the webpage of JointAI.
© Nicole Erler